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Abstract 

We consider the question of averaging on a graph that has one sparse cut separating 
two subgraphs that are internally well connected. While there has been a large body 
of work devoted to algorithms for distributed averaging, nearly all algorithms involve 
only convex updates. In this paper, we suggest that non-convex updates can lead to 
significant improvements. We do so by exhibiting a decentralized algorithm for graphs 
with one sparse cut that uses non-convex averages and has an averaging time that can 
be significantly smaller than the averaging time of known distributed algorithms, such 
as those of [U [2] . We use stochastic dominance to prove this result in a way that may 
be of independent interest. 



1 Introduction 

Consider a Graph G = (V, E), where i.i.d Poisson clocks with rate 1 are associated with each 
edg^j]. We represent the "true" real valued time by T. Each node Vi holds a value Xi(T) at 
time T. Let the average value held by the nodes be x av . Every time an edge e = (v, w) ticks, 
it updates the values of vertices adjacent to it on the basis of present and past values of v, w 
and their immediate neighbors according to some algorithm A. There is an extensive body of 
work surrounding the subject of gossip algorithms in various contexts. Non-convex updates 
have been used in the context of a second order diffusion for load balancing [5] in a slightly 
different setting. The idea there was to take into account the value of the nodes during the 
previous two time steps rather than just the previous one, (in a synchronous setting), and 
set the future value of a node to a non-convex linear combination of the past values of some 



lr rhis model can be simulated using previous models such as [2] by allocating edges to nodes and equipping 
nodes with multiple i.i.d poisson clocks. 



1 



of its neighbors. There is also a line of research on averaging algorithms having two time 
scales, [U H] which is closely related to the present paper. 

In a previous paper [0] , we considered the use of non-convex combinations for gossip on a 
geographic random graph on n nodes. There we showed that one can achieve averaging 
using n 1+ °^ updates if one is willing to allow a certain amount of centralized control. The 
main technical difficulty in using non-convex updates is that they can skew the values held 
by nodes in the short term. We show that nonetheless, in the long term this leads to faster 
averaging. Let the values held by the nodes by X(T) = (xi(T), . . . ,x\ V \{T)) T . We study 
distributed averaging algorithms A which result in 

lim X(T) = x av l, 

where x av is invariant under the passage of time, and show that in some cases there is an 
exponential speed-up in n if one allows the use of non-convex updates, as opposed to only 
convex ones. 



Definition 1 Let 



Let 



T av = sup inf P 



V 



varX(T) 1 
3T > t — > 

' varX(O) e 2 



X(0) = x 



< 



Notation 1 Let a connected graph G = (V, E) have a partition into connected graphs G\ = 
(Vi,Ei), and G2 = (V^-Es)- Specifically, every vertex in V is either in V\ or V2, and every 
edge in E belongs to either E\ or to E2, or to the set of edges E12 that have one endpoint 
in V\ and one in V%. Let |Vi| — n\, \Vi\ = n 2 where without loss of generality, n% < n 2 and 
\V\ = n. Let T van (G\) and T van (G2) be the averaging times of the "vanilla" algorithm that 
replaces at the clock tick of an edge e the values of the endpoints of e by the arithmetic mean 
of the two, applied to G\ and G2 respectively. 

Definition 2 Let C denote the set of algorithms that use only convex updates of the form 

1. Xi(t + ) = aXiitr) + f3xj(t~). 

2. xj(t + ) = axj(t~) + Pxi(t~). 



where a e [0, 1] and a + (5 — 1. 



These updates have been extensively studied, see for example [21 [2]. 
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Theorem 1 The averaging time of any distributed algorithm in C is ^( """(IJiM^I) ) 
Theorem 2 The averaging time of A is 0(}ogn(T van (Gi) + T mn (G 2 ))). 

Note that in the case where G\ and G 2 are sufficiently well connected internally but poorly 
connected to each other, A outperforms any algorithm in C. In fact for the graph G' 
obtained by joining two complete graphs G'±, G' 2 each having | vertices by a single edge, 
= Q(n), while 0(\ogn(T av (G[) + T av (G' 2 ))) = O(logn). 



1.0.1 Algorithm A 

Let the vertices of G\ be labeled by [n\] and those of G 2 by [n 2 ] \ [ni], where [n] := {1, . . . , n}. 
Let e c = {v ni ,v ni+ i be a fixed edge belonging to £12. Let the time of the k th clock tick of 
an edge e be t. Let C » 1 be a sufficiently large absolute constant (independent of n.) 



• If the edge e is e c = (v ni ,v ni+1 ), 

1. If k = -l mod (rC(T mn (G 1 )+T mn (G' 2 ))lnnl) 

(a) x ni (t + ) = x ni (t~) + ri! {x ni+1 (t~) - x ni (t~)} 

(b) x ni+1 (t + ) = x ni+1 (t~) - n x {x ni+1 (t~) - x m (t~)} 

2. If k ^ -1 mod (rC(T wn (Gi) + T mn (G 2 )) Inn]) make no update. 

• If the edge e is u,) ^ £"i 2 

• If e G E12 \ {e c } make no update. 



2 Limitations of convex combinations 

Given a function a(t), let its right limit at t be denoted by a{t+) and its left limit at t by 
a(t~). Consider an algorithm C G C. 

Let us consider the initial condition where ^(0) is the vector that is 1 on vertices v±, . . . , t> ni 

y^™ 1 a; ft) Xi(t) 

of Gi and — ^ on vertices f ni +i, . . . , i> n of G 2 . Let us denote ^ by y(t) and — — 
by z(t). In the model we have considered, with probability 1, at no time does more than one 
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clock tick. 

In the course of the execution any algorithm in C y(t) can change only during clock ticks of 
e c and the same holds for z(t). This is because during a clock tick of any other edge, both 
of whose end- vertices lie in G\ or in G2, y(t) and z(t) do not change. The vertices adjacent 
to e c can change by at most 2 across these instants. Further, the values x n (t) and x n+ \{t) 
are seen to lie in the interval 

[min Xj(0), maxXi(O)] C [—1,1]. 
ie|V| ie|V| 

If the clock of e c ticks at time t, we therefore find that 

\ y (f)-y{r)\<^ (1) 

The number of clocks ticks of e c until time t is a Poisson random variable whose mean is t. 
A direct calculation tells us that 

var(X(t)) > (2) 
n 

To obtain a lower bound for y(t) 2 , we note that the total number of times the clocks of 
edges belonging to E\% tick is a Poisson random variable v t with mean t|i?i2|. It follows from 
Inequality © that y\t) > 1 - |f . 



\E 12 \T av = E[u Tav 
> P 



v Tav > (1 - - -r 
e 4 



:i - -) n i 

e 4 



However 



P 



v Tav > (1 - - -r 
e 4 



must be large, because otherwise y(T av ) would probably be large. More precisely, 



P 



vt > (1 - 



e } 4 



> 1 - P 

> 1-1 



3T > T av , varX(T) > 



Therefore, 



T > P 

- 1 av — 11 



^ > (1 - - 



e 4 LB 



121 



1 . ni 



£"12! 
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3 Using non-convex combinations 



3.0.2 Analysis 

Since T av is denned in terms of variance and algorithm A uses only linear updates, we may 
subtract out the mean from each Xj(0) and it is sufficient to analyze the case when x av = 0. 

Let Vi = M and V 2 = [n] \ [n x \. Let m(t) = and fi 2 = E ^ ^ and n(t) = 

\m(t)\ + |/i 2 (£)|. Let 




n 



We consider time instants T±,T 2 , . . . where T, is the instant at which the clock of edge e 
ticks for the \iC(T van (Gi) + T mn (G 2 )) Inn]* 71 time. Observe that the value of u(t) changes 
only across time instants TJt, k = 1, 2, . . . . 

The amount by which x ni (t) and x ni+ i(t) deviate from Ui(t) and u 2 {i) respectively, can be 
seen to be bounded above by yfna{t) 



max {\x ni (t) - 1^1^)1 ,\x n+ i(t) - fi 2 (t)\} < -Jna{t). 



(3) 



We now examine the evolution of cr(T£) and (J>(T£) as k — ► oo. The statements below are 
true if C is a sufficiently large universal constant (independent of n). 



From to T , fc+1 , independent of x, 



P 



n" 



X(T+) = x 



i 

< — 

An 



Because of inequality (j3J), from T£ to T fc+1 

ff(2ft.i)<nWlw.i) + IM^i)i: 
MT+ +1 )\<nl a(T k+1 ) 

var = n{tf + a(t) 2 . 



We deduce from the above that 



P 



var ) > 



var X(T+ 



n 1 



< 



An 



(4) 
(5) 



(6) 
(7) 



(8) 
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Let Ak be the (random) operator obtained by composing the linear updates from time 
to Tj*~ +1 . Let 1 1 A 1 1 denote the norm of an operator acting from £ 2 to £2 



\A\ 



SUP^gRn || AX || 2 



X 2 



Lemma 1 



P 



\A k \\ 2 >-; 



1 

< - 
- 2 



(9) 



To see this, let Vi, . . . ,v n be the canonical basis for M. n . For any unit vector 

n 



x 



Then, 



|4fc(z)|| < /] I A»| ||A fc fa)|| (Triangle Inequality) 



i=l 



< 



||v4fc(vi) || 2 (Cauchy-Schwartz inequality) 
\ i=i 



(10) 
(11) 



The Lemma now follows from Inequality (|Sj) by an application of the Union Bound. □ 
Moreover, we observe by construction that the norm of A k is less or equal to n, 



\Ak\\ < n 



(12) 



Note that log(var X(T£)) defines a random process (that is not Markov). The updates Ah 
from time to T fc f fl for successive k are i.i.d random operators acting on M. 2n . Note that 

k 

log(varX(T+)) - log(varX(0)) < ^log||A|| 

i=l 

due to the presence of the supremum in the definition of operator norm. 

k 

i=i 

is a random walk on the real line for k — 1, . . . , 00. 

The last and perhaps most important ingredient is that of stochastic dominance. It follows 
from Lemma [T]and Equation [T2] that the random walk {W^} can be coupled with a random 
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walk {Wk} that is always to the right of it on the real line, i. e. for all k, W k < Wk, where 
the increments 

Wk+i — Wk = logn (with probability (13) 

3 1 
= —-logn (with probability -.) (14) 

Noting that by construction, 

log(varX(T+))-log(varX(0)) < W k , (15) 
it follows that T av is upper bounded by any t which satisfies 



P 



VT > t , W T <-2\ > 1 - -. 

J e 



Note that E[W k ] = and E[var# fe ] = f| log 2 n. 

In order to proceed, we shall need the following inequality about simple unbiased random 
walk {Sk}k>o on Z starting at 0. 

Theorem 3 There exist constants c, (3 such that for any n e Z, s > 



Using this fact, 



P[VT > t , W T < -2] = P[VT > t , (logn)(5 T ~\) < -2] (16) 



For large n, this is the same as 



F[yT>t ,s T <^]>i-J2^ p! 1 



T>t 



Clearly, there is a constant t independent of n such that 1 — J2 T>to ce > 1 — K This 
completes the proof. □ 



4 Acknowledgement 

I am grateful to Dimitris Achlioptas, Vivek Borkar, Stephen Boyd and Steven Lalley for 
many helpful discussions. 



7 



References 



[1] V. S. Borkar. 

Stochastic approximation with two time-scales. Systems and Control letters. (1997) 

[2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 

Gossip algorithms : Design, analysis and applications. 

In Proceedings of the 2^th Conference of the IEEE Communications Society (INFOCOM 
2005), 2005. 

[3] D. Bertsekas, J. Tsitsiklis. 

Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, 1989 

[4] V. Konda and J. Tsitsiklis. 

Convergence rate of a linear two time scale stochastic approximation. Annals of Applied 
Probability 2004. 

[5] S. Muthukrishnan, B. Ghosh and M. H. Schultz 

First and Second-Order Diffusive Methods for Rapid, Coarse, Distributed Load Balanc- 
ing. 

Theory of Computing Systems, Volume 31, Number 4, December, 1998 
[6] H. Narayanan. 

Geographic Gossip on Geometric Random Graphs via Affine Combinations. 
In Principles of Distributed Computing (PODC), 2007 



8 



